Updating zero capacity resource semantics#4555
Updating zero capacity resource semantics#4555robertnishihara merged 11 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Should we checking for float precision errors here? Something like RAY_CHECK(capacity > 0 - std::numeric_limits<double>::epsilon())
There was a problem hiding this comment.
I think either way is fine. If it's not causing any issues right now, then we can leave it. @williamma12 is fixing this properly in a separate PR.
|
Can one of the admins verify this patch? |
|
Test FAILed. |
2aa7e80 to
b29bbcc
Compare
|
Test FAILed. |
raulchen
left a comment
There was a problem hiding this comment.
Thanks for this PR. Looks good to me overall. But @robertnishihara knows better about this code. I'll let him give the approval.
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
jenkins retest this please |
|
Test FAILed. |
There was a problem hiding this comment.
@richardliaw We're changing resource semantics - zero capacity resources are now not included in resource datastructures. For instance instead of returning {resource: 0} we now return an empty dictionary.
To make tests work, I've made some changes to resource handling in tune - can you please check if this looks okay?
There was a problem hiding this comment.
I pushed a small tweak - thanks for letting me know!
|
Test FAILed. |
|
jenkins retest this please |
|
Test FAILed. |
|
Test FAILed. |
|
@richardliaw @romilbhardwaj looks like the Jenkins tune test is still failing. |
|
Test FAILed. |
3fe4ff5 to
04d23b6
Compare
python/ray/tests/test_basic.py
Outdated
There was a problem hiding this comment.
This is probably fine, but 5 retries may not be enough in an environment like Travis, so let's keep an eye on whether this test becomes flaky or not and if it starts failing then increase this.
|
Test FAILed. |
Fix typo More fixes. Updates to python functions debug statements Rounding error fixes, removing cpu addition in cython and test fixes. linting Fix worker pool test python linting
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
79d46e0 to
74f647a
Compare
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
| public BaseTaskOptions(Map<String, Double> resources) { | ||
| for (Map.Entry<String, Double> entry : resources.entrySet()) { | ||
| if (entry.getValue().compareTo(0.0) <= 0) { | ||
| throw new RuntimeException(String.format("Resource capacity should be positive, " + |
There was a problem hiding this comment.
Let's use IllegalArgumentException here
| CallOptions callOptions3 = new CallOptions(ImmutableMap.of("CPU", 0.0)); | ||
| Assert.fail(); | ||
| } catch (RuntimeException e) { | ||
| // We should receive a RuntimeException indicate that we should pass a zero capacity resource. |
There was a problem hiding this comment.
| // We should receive a RuntimeException indicate that we should pass a zero capacity resource. | |
| // We should receive a RuntimeException indicates that we should not pass a zero capacity resource. |
|
Test FAILed. |
|
I believe that this commit may be breaking autoscaling. With a head node with num-cpus=1, this commit results in the autoscaler LoadMetrics reporting an empty dict for both static and dynamic resouces once 1 job is scheduled on the head node. This means that the head node never scales up. |
|
@ls-daniel thanks for reporting this. We'll look into it. |
This reverts commit 0f42f87.
This reverts commit 0f42f87.
This PR contains changes that help with memory issues ray-project#4555
What do these changes do?
This PR updates the resource quantity semantics. This PR onwards, a resource with capacity 0 implies it resource does not exist, and consequently no data structure must store a resource with capacity zero.
Linter
scripts/format.shto lint the changes in this PR.